Analysis of single cell data

Data preprocessing and QC

Author

Jennifer Fransson, NBIS

Published

March 7, 2025

Here is some general information about the project and analysis. It can be edited in the file _metadata.yml.

Code
library(Seurat)
library(targets)
library(clustree)
library(dplyr)

tar_make_future()
Code
library(ggplot2)
library(cowplot)

source("R/themes.R")

Study description

This is the study description. It can be edited in the file _metadata.yml.

Original data

Code
tar_load(obj_orig_meta)
tar_load(qc_groupby)

Original cell counts per sample:

Code
knitr::kable(as.data.frame(table(obj_orig_meta[[tar_read(qc_groupby)]])), col.names = c("sample","no of cells"))
sample no of cells
Sample1 1500
Sample2 1500
Code
cellcounts = as.data.frame(table(obj_orig_meta[[tar_read(qc_groupby)]]))
cellcounts$Group = obj_orig_meta[,qc_groupby][match(cellcounts$Var1, obj_orig_meta[[tar_read(qc_groupby)]])]
ggplot(cellcounts, aes(x = Group, y = Freq)) + geom_jitter(width = 0.1) + lims(y = c(0,NA)) +
  labs(x = "", y = "Cell count")

Quality control and filtering

We use four variables to compare the quality of each sample: Number of unique genes (“nFeature_RNA”), number of UMI counts (“nCounts_RNA”), % of mitochondrial RNA in the sample (“percent.mito”) and % of ribosomal RNA in the sample (“percent.ribo”). The quality control metrics are shown for each sample before and after filtering. In the unfiltered data, the dashed lines indicate thresholds that were used to filter the data. The thresholds used were the following:

Code
tar_load(qc_filt_rules)



thresholds = data.frame(
  Parameter = sapply(qc_filt_rules, function(x){
  if(length(x)>3){
    x[[4]]
  }else{
    paste(x[[1]], x[[2]])
  }
}),
  Value = sapply(qc_filt_rules, "[[", 3))
Parameter Value
Min number of unique features per cell 2e+02
Max number of reads (unique UMIs) per cell 1e+06
Max % mitochondrial reads per cell 2e+01
Min % ribosomal reads per cell 5e+00
Code
tar_read(qc_unfiltered_vln)

After the first filtering, the distribution of QC variables are as shown:

Code
tar_read(qc_filtered_vln)

In addition to filtering genes based on the number of cells expressing the gene, individual genes can be removed for reasons such as expected technical artefacts. To identify genes that may introduce biases, we visualize the genes with the highest median expression across cells.

Code
tar_read(top_expr_boxplot)

The removed genes contain the following patterns:

Code
genes_to_remove = sapply(tar_read(qc_genesToRemove), 
                         function(g){
                           paste0("^",g,"$")
})
  
patterns = c(tar_read(qc_genePatternsToRemove), genes_to_remove)

if(length(patterns)>0){
  remove_df = 
    knitr::kable(data.frame(
  Pattern = gsub("\\^","", gsub("\\$","",patterns)),
  Rules = 
  sapply(patterns, function(x){
  rules = c()
  if(substr(x, 1, 1) == "^"){
    rules = c(rules, "Begins with")
  }
  if(substr(x, nchar(x), nchar(x)) == "$"){
    rules = c(rules, "Ends with")
  }
  paste0(rules, collapse = " & ")
})))
}else{
  remove_df = "*No genes were selected for removal*"
}
Pattern Rules
RP[SL] Begins with
MT- Begins with
MALAT1 Begins with & Ends with

Cell cycle prediction

In order to estimate whether there are differences between samples considering the cell cycle state at time of collection, and whether this needs to be accounted for downstream, “cell cycle scores” are added. This is an average of scaled expression of a list of genes indicative of the S phase and the G2-M phase in the cell cycle.

To specify which genes should be used for S phase and G2/M phase scores, edit the files parameters/cc_S_genes.csv and parameters/cc_G2M_genes.csv (note: these .csv-files should not contain column names). If no genes are indicated in these files, the default gene lists given by Seurat will be used (see Seurat::cc.genes.updated.2019$s.genes and Seurat::cc.genes.updated.2019$g2m.genes). Note that these default lists concern human genes.

Code
tar_read(cc_violin)

Doublet prediction

Code
DFmessage = "Doublet removal was not performed."

if(tar_read(qc_runDF)){
  DFmessage = 'In droplet-based sequencing, a number of droplets containing more than one cell is expected. In order to predict which "cells" in the dataset are actually composed of more than one cell, the package [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) is used. Doublet cells are then removed from the dataset.

To evaluate whether the doublet prediction worked well, we look at two plots:

1. A preliminary UMAP in which doublets are indicated (doublets are likely to be found between clusters)

2. A violin plot showing the distribution of number of unique features per cell, split by predicted doublets and singlets in each sample (on average, doublets have a higher number of unique genes)'
}

In droplet-based sequencing, a number of droplets containing more than one cell is expected. In order to predict which “cells” in the dataset are actually composed of more than one cell, the package DoubletFinder is used. Doublet cells are then removed from the dataset.

To evaluate whether the doublet prediction worked well, we look at two plots:

  1. A preliminary UMAP in which doublets are indicated (doublets are likely to be found between clusters)

  2. A violin plot showing the distribution of number of unique features per cell, split by predicted doublets and singlets in each sample (on average, doublets have a higher number of unique genes)

Code
if(tar_read(qc_runDF)){
  tar_read(doublet_plots)$dimplot & labs(title = "Doublets")
}

Code
if(tar_read(qc_runDF)){
  tar_read(doublet_plots)$vlnplot
}

Final cell counts

The final number of cells are as follows:

Code
tar_load(obj_filt_final_meta)
Code
knitr::kable(as.data.frame(table(obj_filt_final_meta[[tar_read(qc_groupby)]])), col.names = c("sample","no of cells"))
sample no of cells
Sample1 828
Sample2 552
Code
cellcounts = as.data.frame(table(obj_filt_final_meta[[tar_read(qc_groupby)]]))
cellcounts$Group = obj_filt_final_meta[,qc_groupby][match(cellcounts$Var1, obj_filt_final_meta[[tar_read(qc_groupby)]])]
ggplot(cellcounts, aes(x = Group, y = Freq)) + geom_jitter(width = 0.1) + lims(y = c(0,NA)) +
  labs(x = "", y = "Cell count")

The distributions of the QC variables are also shown.

Code
tar_read(qc_final_vln)

Normalization

Code
if(tar_read(dimred_sct)){
  normalization_method = paste0("Normalization was performed using the SCTransform method on the top ",
                                tar_read(dimred_nHVG)," most variable features.")
  if(length(tar_read(dimred_varstoregress))>0){
    normalization_method = paste0(normalization_method, " The following variables were regressed out: ",
                                  paste0(tar_read(dimred_varstoregress), collapse = ", "),".")
  }
}else{
  normalization_method = "Normalization was performed using the NormalizeData method."
  if(length(tar_read(dimred_varstoregress))>0){
    normalization_method = paste0(normalization_method, " The following variables were regressed out when scaling the top ",
                                  tar_read(dimred_nHVG)," most variable features: ",
                                  paste0(tar_read(dimred_varstoregress), collapse = ", "),".")
  }
}

if(!tar_read(load_joinlayers)){
  HVGmessage = "*Note: This analysis uses the feature of layers introduced in Seurat v5. A feature of this functionality is that variance for each gene is calculated on each original sample individually, and the variability for each gene is calculated based on the ranks in each sample. This means that unlike previous Seurat versions, the genes with the most variability in the plot are not necessarily the most variable when combining the ranked lists.*"
}else{
  HVGmessage = ""
}

Normalization was performed using the NormalizeData method. The following variables were regressed out when scaling the top 2000 most variable features: nFeature_RNA.

Dimensional reduction

Code
tar_load(obj_pca_umap)

PCA

Dimensional reduction was performed using PCA. To do this, the top 2000 most variable features were used. They are visualized below (in red), showing the mean expression and standardized variance for each gene. The top 20 genes are labeled.

Code
top20 <- head(VariableFeatures(obj_pca_umap), 20)

LabelPoints(plot = VariableFeaturePlot(obj_pca_umap), points = top20, repel = TRUE)

The top variable genes are used to produce an initial PCA.

Code
addColors(DimPlot(obj_pca_umap, reduction = "pca", shuffle = TRUE), layer = 1)

The top genes contributing positively or negatively to the first 4 principal components are as shown:

Code
VizDimLoadings(obj_pca_umap, dims = 1:4, nfeatures = 20, balanced = TRUE, ncol = 4) &
  geom_vline(xintercept = 0, linetype = 2, color = "grey") &
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

The variance explained by each PC helps us estimate how many PCs will be useful to include in order to retain maximum variance while reducong noise. An “elbow plot” is useful for this.

Code
ElbowPlot(obj_pca_umap, ndims = 50)

UMAP

A UMAP was produced to visualize the relative similarities between cells. For this, the following parameters were set:

Code
shortenvector = function(x){
  if(length(x)>1){
    if(is.numeric(x) | is.integer(x)){
      if(identical(as.numeric(x), as.numeric(min(x):max(x)))){
        x = paste0(min(x), ":", max(x))
      }
    }
    x = paste0(x, collapse = ", ")
  }
  x
}

umap_params = data.frame(
  Parameter = c("Number of neighbors",
                "Number of components to produce",
                "Minimum distance",
                "Number of PCs to consider"),
  Value = c(
    shortenvector(tar_read(dimred_umap_nn)),
    shortenvector(tar_read(dimred_umap_ncomp)),
    shortenvector(tar_read(dimred_umap_mindist)),
    shortenvector(tar_read(dimred_umap_dims))
    )
)
Parameter Value
Number of neighbors 30
Number of components to produce 2
Minimum distance 0.3
Number of PCs to consider 1:20
Code
addColors(DimPlot(obj_pca_umap, shuffle = TRUE), layer = 1)

Code
FeaturePlot(obj_pca_umap, c("nFeature_RNA","nCount_RNA","percent.mito", "percent.ribo","percent.hb"), ncol = 2) & 
  theme(plot.title = element_text(size = 12),
        legend.text = element_text(size =10),
        axis.text = element_blank(),
        axis.ticks = element_blank())

Code
rm(obj_pca_umap)
gc()

Integration

Code
if(tar_read(run_int)){
  intmessage = paste0("Integration was run to reduce batch effects considering each ",tar_read(int_split)," as a separate batch.")
  show_int = TRUE
}else{
  intmessage = "Integration between samples/batches was not run in this analysis."
  show_int = FALSE
}

Integration was run to reduce batch effects considering each Sample as a separate batch.

Code
tar_load(obj_int_umap)
Code
reductions = c(
  "pca_nonint","umap_nonint",
  ifelse(tar_read(load_joinlayers), "pca","integrated.cca"),
  "umap"
)

plot_grid(
  plotlist = 
    lapply(reductions, function(red){
      addColors(DimPlot(obj_int_umap, reduction = red, shuffle = TRUE) & 
        NoLegend() & 
        theme(axis.text = element_blank(), axis.title = element_blank(),
              axis.ticks = element_blank()) &
        labs(title = red), layer = 1)
}))

Code
addColors(DimPlot(obj_int_umap, reduction = "umap", shuffle = TRUE,
                  split.by = tar_read(int_split)), layer = 1)

Code
rm(obj_int_umap)
gc()

Clustering

Clustering was performed on the reduction pca, at resolutions 0.25, 0.5, 1, 1.5.

Code
tar_load(obj_clus)
Code
clustername = paste0("clusters_",tar_read(clus_reduction),"_")

clustering = colnames(obj_clus@meta.data)[grepl(clustername, colnames(obj_clus@meta.data))]
plot_grid(
  plotlist = 
    lapply(clustering, function(clus){
      shadow_text(addColors(DimPlot(obj_clus, reduction = "umap", shuffle = TRUE,
              group.by = clus, label = TRUE) & 
        NoLegend() & 
        theme(axis.text = element_blank(), axis.title = element_blank(),
              axis.ticks = element_blank()) &
        labs(title = clus),
        scale_type = "colour", layer = 1))
}))

Code
clustree(obj_clus, prefix = clustername)

Resolution for annotation

Code
cluster_res = paste0("clusters_",
                     tar_read(clus_reduction),
                     "_",
                     tar_read(clus_de_res))

For annotation purposes and comparison between samples, a clustering resolution was chosen. For this report, the cluster of interest was cluster_res.

Code
obj_clus$Cluster = factor(obj_clus@meta.data[[cluster_res]])

addColors(ggplot(obj_clus@meta.data, 
       aes(x = Sample)) + 
  geom_bar(aes(fill = Cluster),
           position = "fill"),
  "fill", layer = 1)

Code
addColors(ggplot(obj_clus@meta.data, 
       aes(x = Cluster)) + 
  geom_bar(aes(fill = Sample), position = "fill"),
  scale_type = "fill", layer = 1)

Code
tar_read(cluster_de_barplot)

Code
addColors(VlnPlot(obj_clus, 
        features = tar_read(qc_plotvars),
        group.by = "Cluster", 
        pt.size = 0, ncol = 2) & violintheme(),
        scale_type = "fill")

Code
addColors(VlnPlot(obj_clus, 
        features = c("S.Score","G2M.Score"),
        group.by = "Cluster", 
        pt.size = 0, ncol = 2) & violintheme(),
        scale_type = "fill")

Code
top3 = unique(tar_read(cluster_de)$de_res %>%
  group_by(cluster) %>%
  slice_min(order_by = p_val_adj, n = 3) %>%
    pull(gene))


addColors(VlnPlot(obj_clus, group.by = "Cluster",
        features = top3, pt.size = 0, assay = "RNA") &
  violintheme(), scale_type = "fill")

Workflow warnings

Code
knitr::kable(targets::tar_meta(fields = warnings, complete_only = TRUE))
name warnings
cc_violin Subsetting quosures with is deprecated as of rlang 0.4.0Please use quo_get_expr instead.This warning is displayed once every 8 hours.
doublet_plots The default method for RunUMAP has changed from calling Python UMAP via reticulate to the Rnative UWOT using the cosine metricTo use Python UMAP via reticulate, set umap.method to umaplearn and metric to correlationThis message will be shown once per session
metadata incomplete final line found by readTableHeader on datametadata.csv
obj_cc The following features are not present in the object RAD51, not searching for symbol synonyms. The following features are not present in the object PIMREG, CDC25C, not searching for symbol synonyms
obj_df The default method for RunUMAP has changed from calling Python UMAP via reticulate to the Rnative UWOT using the cosine metricTo use Python UMAP via reticulate, set umap.method to umaplearn and metric to correlationThis message will be shown once per session. UNRELIABLE VALUE One of the future.apply iterations future_lapply1 unexpectedly generated random numbers without declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify future.seedTRUE. This ensures that proper, parallelsafe random numbers are produced via the LEcuyerCMRG method. To disable this check, use future.seed NULL, or set option future.rng.onMisuse to ignore.
obj_filt No layers found matching search pattern provided. data layer is not found and counts layer is used
obj_int Different features in new layer data than already exists for scale.data. Different features in new layer data than already exists for scale.data. Layer counts isnt present in the assay object returning NULL
obj_int_umap The default method for RunUMAP has changed from calling Python UMAP via reticulate to the Rnative UWOT using the cosine metricTo use Python UMAP via reticulate, set umap.method to umaplearn and metric to correlationThis message will be shown once per session
obj_pca_umap The default method for RunUMAP has changed from calling Python UMAP via reticulate to the Rnative UWOT using the cosine metricTo use Python UMAP via reticulate, set umap.method to umaplearn and metric to correlationThis message will be shown once per session

Session info

Code
sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS/LAPACK: /Users/jenfra/miniconda3/envs/seurat5/lib/libopenblasp-r0.3.28.dylib;  LAPACK version 3.12.0

locale:
[1] C/UTF-8/C/C/C/C

time zone: Europe/Stockholm
tzcode source: system (macOS)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] shadowtext_0.1.4   cowplot_1.1.3      dplyr_1.1.4        clustree_0.5.1    
 [5] ggraph_2.2.1       ggplot2_3.5.1      targets_1.9.1      Seurat_5.1.0      
 [9] SeuratObject_5.0.2 sp_2.1-4          

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3     jsonlite_1.8.9         magrittr_2.0.3        
  [4] ggbeeswarm_0.7.2       spatstat.utils_3.1-0   farver_2.1.2          
  [7] rmarkdown_2.28         vctrs_0.6.5            ROCR_1.0-11           
 [10] memoise_2.0.1          spatstat.explore_3.2-6 htmltools_0.5.8.1     
 [13] sctransform_0.4.1      parallelly_1.38.0      KernSmooth_2.23-24    
 [16] htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9            
 [19] cachem_1.1.0           plotly_4.10.4          zoo_1.8-12            
 [22] igraph_2.0.3           mime_0.12              lifecycle_1.0.4       
 [25] pkgconfig_2.0.3        Matrix_1.6-5           R6_2.5.1              
 [28] fastmap_1.2.0          fitdistrplus_1.2-1     future_1.34.0         
 [31] shiny_1.9.1            digest_0.6.37          colorspace_2.1-1      
 [34] patchwork_1.3.0        ps_1.8.0               tensor_1.5            
 [37] RSpectra_0.16-2        irlba_2.3.5.1          base64url_1.4         
 [40] labeling_0.4.3         progressr_0.14.0       fansi_1.0.6           
 [43] spatstat.sparse_3.1-0  httr_1.4.7             polyclip_1.10-7       
 [46] abind_1.4-5            compiler_4.3.3         withr_3.0.1           
 [49] backports_1.5.0        viridis_0.6.5          fastDummies_1.7.4     
 [52] ggforce_0.4.2          MASS_7.3-60.0.1        tools_4.3.3           
 [55] vipor_0.4.7            lmtest_0.9-40          beeswarm_0.4.0        
 [58] httpuv_1.6.15          future.apply_1.11.2    goftest_1.2-3         
 [61] glue_1.8.0             callr_3.7.6            nlme_3.1-165          
 [64] promises_1.3.0         grid_4.3.3             checkmate_2.3.2       
 [67] Rtsne_0.17             cluster_2.1.6          reshape2_1.4.4        
 [70] generics_0.1.3         gtable_0.3.5           spatstat.data_3.1-2   
 [73] tidyr_1.3.1            data.table_1.15.4      tidygraph_1.3.0       
 [76] utf8_1.2.4             spatstat.geom_3.2-9    RcppAnnoy_0.0.22      
 [79] ggrepel_0.9.6          RANN_2.6.2             pillar_1.9.0          
 [82] stringr_1.5.1          spam_2.11-0            RcppHNSW_0.6.0        
 [85] later_1.3.2            splines_4.3.3          tweenr_2.0.3          
 [88] lattice_0.22-6         survival_3.7-0         deldir_2.0-4          
 [91] tidyselect_1.2.1       miniUI_0.1.1.1         pbapply_1.7-2         
 [94] knitr_1.48             gridExtra_2.3          scattermore_1.2       
 [97] xfun_0.48              graphlayouts_1.2.0     matrixStats_1.4.1     
[100] stringi_1.8.4          lazyeval_0.2.2         yaml_2.3.10           
[103] evaluate_1.0.1         codetools_0.2-20       tibble_3.2.1          
[106] cli_3.6.3              uwot_0.1.16            secretbase_1.0.3      
[109] xtable_1.8-4           reticulate_1.39.0      munsell_0.5.1         
[112] processx_3.8.4         Rcpp_1.0.13            globals_0.16.3        
[115] spatstat.random_3.2-3  png_0.1-8              ggrastr_1.0.2         
[118] parallel_4.3.3         dotCall64_1.2          listenv_0.9.1         
[121] viridisLite_0.4.2      scales_1.3.0           ggridges_0.5.6        
[124] leiden_0.4.3.1         purrr_1.0.2            rlang_1.1.4